Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import types as T

from pyspark.sql import functions as F

from datetime import datetime
from decimal import Decimal

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2 - Performing your First Transformations")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path

pets = spark.read.csv(path, header=True)
pets.toPandas()

	id	breed_id	nickname	birthday	age	color
0	1	1	King	2014-11-22 12:30:31	5	brown
1	2	3	Argus	2016-11-22 10:05:10	10	None
2	3	1	Chewie	2016-11-22 10:05:10	15	None

Transformation

(
    pets
    .withColumn('birthday_date', F.col('birthday').cast('date'))
    .withColumn('owned_by', F.lit('me'))
    .withColumnRenamed('id', 'pet_id')
    .where(F.col('birthday_date') > datetime(2015,1,1))
).toPandas()

	pet_id	breed_id	nickname	birthday	age	color	birthday_date	owned_by
0	2	3	Argus	2016-11-22 10:05:10	10	None	2016-11-22	me
1	3	1	Chewie	2016-11-22 10:05:10	15	None	2016-11-22	me

What Happened?

We renamed the primary key of our df
We truncated the precision of our date types.
we filtered our dataset to a smaller subset.
We created a new column describing who own these pets.

Summary

We performed a variety of spark transformations to transform our data, we will go through these transformations in detailed in the following section.

results matching ""

No results matching ""